Simulation Exercise – Job Demonstration
by Barry Daemi
Wells Fargo Quantitative Analytics Program - Risk Analytics & Decision Science Track (Master's)
Southern Methodist University
November 12, 2022 $$\newcommand{\C}{\mathbb{C}} \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\P}{\mathbb{P}} \newcommand{\F}{\mathbb{F}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}}$$
-
Task Description:
Financial institutions that lend to consumers rely on models to help decide on who to approve or decline for credit (for lending products such as credit, automobile loans, or home loans). In this job simulation, your task is to develop models that review credit card applications to determine which ones should be approved. You are given historical data containing one response (binary) and 20 predictor variables from credit card accounts for a hypothetical bank XYZ.
Introduction
A large wealth of literature is available on the interwebs concerning the topics of model selection for credit application approval [1], creddit approval analysis and modelling [2,3], and model validation [2]. And in consequence there exists over-abundance of pathways of analysis and model development for credit approval; though for the purpose of the completing the aforementationed Task Description, we decided to only select two predominant supervised machine learning classifier algorithm to perform the modeling aspect of the project; we choice Logistic Regression classifer and Random forest classifier to perform the two modeling section of this project.
In the first section, we performed the necessary data formatting and analysis on the provided dataset, to ready it for the purpose of modeling. In the second section, we develop the theoretical underpinnings of a neural network implementation of logistic regression, though we did develop our own proprietary software instead we relied upon the open-source neural network implemented logistic regression from sklearn. We follow the logistic regression model's results with statistical inference commentary. In the third section, we cover the theoretical framework for random forest, and implement said model through the open-source package sklearn. Same with the results of the logistic regression model, we following the results of the random forest model with statistical inference based commentary. Lastly in the fourth section, we decision model selection, and model use in real-world credit approval application.
For the purpose of model replication, we used the following code block alongside our package implementation, we added print statement to the display in the console the versions of each package that was utilized.
import numpy as np
import scipy.special as sp
import pandas as pd
import statsmodels.api as sm
import seaborn
import matplotlib
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
print("Numpy " + str(np.__version__));
print("Pandas " + str(pd.__version__));
print("Seaborn "+str(seaborn.__version__));
print("Matplotlib.pyplot "+str(matplotlib.__version__));
print("Sklearn "+str(sk.__version__));
Numpy 1.21.5 Pandas 1.4.2 Seaborn 0.11.2 Matplotlib.pyplot 3.5.1 Sklearn 1.0.2
Just for the desire for creativity we provided the Task Description inside a center gray box. To create said gray block containing the Task description text, we required a snippet of CSS code. As Markdown compiles into HTML, and thus, requires a HTML compiler to compile and execute said code, like others, we knew Jupyter Notebook possesses a HTML compiler. And therefore, we knew we could pass on the following CSS code to the HTML compiler, which has the capacity to call the proper CSS compiler to compile and execute said CSS code [4]. An additional note is that the Markdown text blocks do not contain Markdown code, but instead straight HTML code. The purpose was to retain the full-functionality of the HTML5 as a markup language, which can heavily custom text, which can further the aestethic of a said document.
%%html
<style>#toc_container {
background: #f9f9f9 none repeat scroll 0 0;
border: 1px solid #aaa;
display: table;
font-size: 85%;
margin-bottom: 1em;
padding: 20px;
width: auto;
}
.toc_title {
font-weight: 700;
text-align: center;
}
#toc_container li, #toc_container ul, #toc_container ul li{
list-style: outside none none !important;
}</style>
<style>
table, th, td {
border:1px solid black;
}
</style>
Section 1: Data Analysis
We imported the training dataset, Training_R-197135_Candidate Attach #1_JDSE_SRF #456.csv.csv through Pandas.read_csv function; we named the imported dataframe as train_df. For reader convenience, we printed train_df to the console, so they are able observe the what was the resulted dataset.
train_df=pd.read_csv("Training_R-197135_Candidate Attach #1_JDSE_SRF #456.csv.csv")
train_df
| tot_balance | avg_bal_cards | credit_age | credit_age_good_account | credit_card_age | num_acc_30d_past_due_12_months | num_acc_30d_past_due_6_months | num_mortgage_currently_past_due | tot_amount_currently_past_due | num_inq_12_month | ... | num_card_12_month | num_auto_ 36_month | uti_open_card | pct_over_50_uti | uti_max_credit_line | pct_card_over_50_uti | ind_XYZ | rep_income | rep_education | Def_ind | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 102956.11010 | 14819.057400 | 238 | 104 | 264 | 0 | 0 | 0 | 0.000000 | 0 | ... | 1 | 0 | 0.366737 | 0.342183 | 0.513934 | 0.550866 | 0 | 118266.32130 | college | 0 |
| 1 | 132758.72580 | 18951.934550 | 384 | 197 | 371 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.490809 | 0.540671 | 0.418016 | NaN | 0 | 89365.05765 | college | 0 |
| 2 | 124658.91740 | 15347.929690 | 277 | 110 | 288 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.359074 | 0.338560 | 0.341627 | 0.451417 | 0 | 201365.12130 | college | 0 |
| 3 | 133968.53690 | 14050.713340 | 375 | 224 | 343 | 0 | 0 | 0 | 0.000000 | 2 | ... | 1 | 0 | 0.700379 | 0.683589 | 0.542940 | 0.607843 | 0 | 191794.48550 | college | 0 |
| 4 | 143601.80170 | 14858.515270 | 374 | 155 | 278 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.647351 | 0.510812 | 0.632934 | 0.573680 | 0 | 161465.36790 | graduate | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19995 | 89665.13930 | 11521.159950 | 319 | 139 | 363 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.535628 | 0.634712 | 0.527230 | 0.602345 | 0 | NaN | high_school | 0 |
| 19996 | 136211.63530 | 17977.054130 | 297 | 137 | 273 | 0 | 0 | 0 | 0.000000 | 2 | ... | 0 | 0 | 0.464774 | 0.450030 | 0.545108 | NaN | 1 | NaN | high_school | 0 |
| 19997 | 110721.87650 | 13316.820540 | 304 | 151 | 257 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.264544 | 0.340289 | 0.412155 | NaN | 0 | 157706.15810 | college | 0 |
| 19998 | 96742.36371 | 11743.262370 | 275 | 141 | 294 | 2 | 1 | 1 | 3009.387661 | 0 | ... | 0 | 0 | 0.609226 | 0.582007 | 0.301612 | 0.697052 | 1 | 97387.97414 | college | 1 |
| 19999 | 107338.82070 | 7942.952546 | 325 | 195 | 302 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.358067 | 0.435511 | 0.349246 | NaN | 0 | 165447.16380 | college | 0 |
20000 rows × 21 columns
The resulted dataset, train_df, possessed twenty-thousand observations and twenty-one variables, twenty of which were the predictor features, and while only one variable was the target variable, Def_Ind. We created the following reference table with the details of the twenty predictor features each with their own accompanied description.
tot_balance: |
Total balance | num_card_inq_24: |
Number of credit card inquiries in last 24 months |
|---|---|---|---|
avg_bal_cards: |
Average balance over all active cards | num_card_12_month: |
Number of credit cards opened in last 12 months |
credit_age: |
Age in months of first credit product | num_auto_ 36_month: |
Number of auto loans opened in last 36 months |
credit_age_good: |
Age in months of oldest credit product obtained | uti_open_card: |
Utilization on open credit card accounts |
credit_card_age: |
Age in months of applicant’s oldest credit card | pct_over_50_uti: |
Percentage of open accounts with over 50% utilization |
num_acc_30d_past_: |
Number of accounts that are 30 or more days delinquent within last 12 months | uti_max_credit: |
Utilization on credit account with highest credit limit |
num_acc_30d_past_: |
Number of accounts that are 30 or more days delinquent within last 6 months | pct_card_over: |
Percentage of open credit cards with over 50% utilization |
num_mortgage_: |
Number of mortgages delinquent in last 6 months | ind_XYZ: |
Indicator: 1 if applicant already has some account (checking/savings, etc.) with the bank XYZ; 0 otherwise |
tot_amount_: |
Total amount past due currently for all credit accounts | rep_income: |
annual income (self-reported by applicant and not verified) |
num_inq_12_month: |
Number of inquiries in last 12 months | rep_education: |
education level (self-reported by applicant and not verified): Four levels: high-school or below; college degree; graduate degree; other |
We also created a reference table for the target variable Def_Ind.
Def_Ind: |
Binary: 1 = account defaulted after an account was approved and opened with bank XYZ in the past 18 months; 0 = not defaulted |
|---|
The source of these variable descriptions were sourced directly from the Task description documentation provided; we hope these quick reference table prove useful to the reader.
Throgh train_df.isna().any(), we found that the columns: pct_card_over_50_uti, rep_income and rep_education all contained missing data. With further inquiry through train_df.isna().sum(), we were able to discern that pct_card_over_50_uti had $1,958$ missing values, pct_card_over_50_uti had $1,559$ missing values and rep_education only had one missing value. This discernment is important, as the observations with missing data that are left in the training dataset will contribute to an increase in bias in eiterh the training of an logistic regression, or random forest algorithm, e.g., the missing data will contribute to a overfitted model. To prevent this increase in bias in the logistic regression, or random forest model due to the missing data; we had to delete the observations that had missing values [5]. Though deletion of observation with missing data can be the simplest and most convenient approaches to said bias problem, it can also result in an increase in bias in the model, given the column variables that are missing are not randomly distributed random variables, but instead deterministic values [6].
Deletion
In this approach all entries with missing values are removed/discarded when doing analysis. Deletion is considered the simplest approach as there is no need to try and estimate value. However, the authors of Little and Rubin [18] have demonstrated some of the weakness of deletion, as it introduce bias in analysis, especially when the missing data is not randomly distributed. The process of deletion can be carried out in two ways, pairwise or list-wise deletion [32].
Fortuntely the each of the columns values that possessed missing values were all sampled random variables, and therefore data is randomly distributed. Though on an additional note, it would be remiss of us, if we did not mention that in Emmanuel et al. (2021) [6], that the purpose of said research article was the development of a algorithms, "k nearest neighbor and an iterative imputation method (missForest)" [6], derived from the random forest algorithm, which can handle missing data with far better success then conventional random forest; though the implication and implementation of said algorithms are outside the scope of this project - though we encourage the reader to read said article, as it seems promising in a research perspective.
Though the theory behind duplicated data is not yet settled, we have observed suggestions, such as [7], that duplicated data can contribute to a model's varience, in otherwords, duplicated data can cause an overfitting of the model due to over emphasizing duplicated data over novel data during training. Overfitting in a mathematical term is the "loss of generality", that a solution can experience if it only applicable to a unique set of cases, in this context a unique set of datasets. With said lost of generality, said model predictive capacity is limited to the aforementationed unique set of datasets that it was fitted. As this is not the desire result, we decided to discern if duplicated data existed in train_df, through train_df2.duplicated(); foruntately, all of the observations were found to be unique.
We implemented train_df.dtypes, as a way to observe the data types of each column; we found that rep_education was an object data type, which neither logistic regression nor random variable is able handle; as a result we converted rep_education from a object column to a categorical column through an if nested boolean tree. As no specifications were given in regards to report educational attainment, we decide through assumption that we categorize 'other' as $1$, 'high_school' as $2$, 'college' as $3$ and 'graduate' as $4$. In essence, we made the assumption that educational attianment was a tier, that it was ordered as, rep_education: other $<$ high school $<$ college $<$ graduate.
After the following data formatting steps, train_df size changed from (20000, 21) to (16653, 21), and was renamed to train_df2.
print(train_df.isna().any()); # Checks for missing data in each column
print(" ");
print(train_df.isna().sum()); # The number of missing data in each column
print(" ");
print(train_df.dtypes)
print(" ")
train_df2=train_df.dropna();
a=list(train_df2['rep_education'])
b=[];
for i in range(len(a)):
if a[i]=='Other':
b.append(1)
elif a[i]=='high_school':
b.append(2)
elif a[i]=='college':
b.append(3)
elif a[i]=='graduate':
b.append(4)
else:
b.append(0);
train_df2['rep_education']=b;
print('Number of duplicate accounts: '+str(sum(train_df2.duplicated()))); # Number of duplicate obs.
print(" ");
print("Size of train_df: " +str(train_df.shape));
print("Size of train_df2 "+str(train_df2.shape));
tot_balance False avg_bal_cards False credit_age False credit_age_good_account False credit_card_age False num_acc_30d_past_due_12_months False num_acc_30d_past_due_6_months False num_mortgage_currently_past_due False tot_amount_currently_past_due False num_inq_12_month False num_card_inq_24_month False num_card_12_month False num_auto_ 36_month False uti_open_card False pct_over_50_uti False uti_max_credit_line False pct_card_over_50_uti True ind_XYZ False rep_income True rep_education True Def_ind False dtype: bool tot_balance 0 avg_bal_cards 0 credit_age 0 credit_age_good_account 0 credit_card_age 0 num_acc_30d_past_due_12_months 0 num_acc_30d_past_due_6_months 0 num_mortgage_currently_past_due 0 tot_amount_currently_past_due 0 num_inq_12_month 0 num_card_inq_24_month 0 num_card_12_month 0 num_auto_ 36_month 0 uti_open_card 0 pct_over_50_uti 0 uti_max_credit_line 0 pct_card_over_50_uti 1958 ind_XYZ 0 rep_income 1559 rep_education 1 Def_ind 0 dtype: int64 tot_balance float64 avg_bal_cards float64 credit_age int64 credit_age_good_account int64 credit_card_age int64 num_acc_30d_past_due_12_months int64 num_acc_30d_past_due_6_months int64 num_mortgage_currently_past_due int64 tot_amount_currently_past_due float64 num_inq_12_month int64 num_card_inq_24_month int64 num_card_12_month int64 num_auto_ 36_month int64 uti_open_card float64 pct_over_50_uti float64 uti_max_credit_line float64 pct_card_over_50_uti float64 ind_XYZ int64 rep_income float64 rep_education object Def_ind int64 dtype: object Number of duplicate accounts: 0 Size of train_df: (20000, 21) Size of train_df2 (16653, 21)
C:\Users\Barry\AppData\Local\Temp\ipykernel_15592\3229321580.py:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy train_df2['rep_education']=b;
To best summarize the relationship between predictor variables, we used seaborn.pairplot(train_df2), which produced the following graphic. Note by double clicking the graphic, activates a zoom, so that the graphic become legible.
g=seaborn.pairplot(train_df2);
Following code block attains the correlation matrix, $\text{Corr}$, of the dataset train_df2, and draws a heatmap corresponding to the correlation in each predictor feature, though the map does not display meaningless self-correlation values, that are found in the diagonal elements.
corr=train_df2.corr(); # Correlation matrix
mask=np.zeros_like(corr,dtype=bool); # Generate a mask for the upper triangle
mask[np.triu_indices_from(mask)]=True;
f,ax =plt.subplots(figsize=(11,9)); # Set up the matplotlib figure
# Generate a custom diverging colormap
cmap=seaborn.diverging_palette(220,10,as_cmap=True);
# Draw the heatmap with the mask and correct aspect ratio
seaborn.heatmap(corr,mask=mask,cmap=cmap,vmax=0.5,
linewidths=0.5,cbar_kws={"shrink": 0.5},ax=ax,);
corr
| tot_balance | avg_bal_cards | credit_age | credit_age_good_account | credit_card_age | num_acc_30d_past_due_12_months | num_acc_30d_past_due_6_months | num_mortgage_currently_past_due | tot_amount_currently_past_due | num_inq_12_month | ... | num_card_12_month | num_auto_ 36_month | uti_open_card | pct_over_50_uti | uti_max_credit_line | pct_card_over_50_uti | ind_XYZ | rep_income | rep_education | Def_ind | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tot_balance | 1.000000 | 0.706008 | 0.018683 | 0.008407 | 0.017945 | -0.025365 | -0.014801 | -0.020003 | -0.017746 | -0.018269 | ... | -0.015882 | -0.000527 | -0.025588 | -0.028925 | -0.013396 | -0.019086 | -0.004480 | 0.002867 | 0.010655 | -0.090389 |
| avg_bal_cards | 0.706008 | 1.000000 | 0.012409 | 0.008274 | 0.010635 | -0.017284 | -0.006276 | -0.015438 | -0.009335 | -0.010216 | ... | -0.014330 | 0.001825 | -0.027921 | -0.027622 | -0.021695 | -0.020415 | 0.000823 | -0.001124 | 0.006176 | -0.112316 |
| credit_age | 0.018683 | 0.012409 | 1.000000 | 0.799485 | 0.851878 | -0.033706 | -0.021219 | -0.018264 | -0.028571 | -0.024310 | ... | -0.010588 | 0.001013 | -0.046747 | -0.039747 | -0.035542 | -0.041107 | 0.008604 | 0.014881 | 0.026740 | -0.101712 |
| credit_age_good_account | 0.008407 | 0.008274 | 0.799485 | 1.000000 | 0.676699 | -0.031891 | -0.024850 | -0.022256 | -0.029034 | -0.017317 | ... | -0.013260 | 0.001666 | -0.033019 | -0.030550 | -0.027794 | -0.028352 | 0.010351 | 0.006632 | 0.024705 | -0.080229 |
| credit_card_age | 0.017945 | 0.010635 | 0.851878 | 0.676699 | 1.000000 | -0.025152 | -0.014529 | -0.014072 | -0.021689 | -0.021692 | ... | -0.002793 | -0.000250 | -0.048297 | -0.039013 | -0.041591 | -0.039819 | 0.007888 | 0.015414 | 0.020083 | -0.087758 |
| num_acc_30d_past_due_12_months | -0.025365 | -0.017284 | -0.033706 | -0.031891 | -0.025152 | 1.000000 | 0.710836 | 0.730372 | 0.807057 | 0.037867 | ... | 0.018590 | 0.002128 | 0.057994 | 0.042795 | 0.037550 | 0.052665 | -0.023432 | -0.003573 | -0.023003 | 0.278412 |
| num_acc_30d_past_due_6_months | -0.014801 | -0.006276 | -0.021219 | -0.024850 | -0.014529 | 0.710836 | 1.000000 | 0.740790 | 0.778626 | 0.034995 | ... | 0.007625 | 0.005726 | 0.031946 | 0.024165 | 0.020911 | 0.022819 | -0.007584 | -0.009195 | -0.011348 | 0.242955 |
| num_mortgage_currently_past_due | -0.020003 | -0.015438 | -0.018264 | -0.022256 | -0.014072 | 0.730372 | 0.740790 | 1.000000 | 0.767837 | 0.030150 | ... | 0.016962 | 0.002660 | 0.038531 | 0.031204 | 0.023445 | 0.033175 | -0.013731 | -0.016240 | -0.016327 | 0.247359 |
| tot_amount_currently_past_due | -0.017746 | -0.009335 | -0.028571 | -0.029034 | -0.021689 | 0.807057 | 0.778626 | 0.767837 | 1.000000 | 0.034072 | ... | 0.008523 | 0.001563 | 0.037530 | 0.027323 | 0.024718 | 0.030772 | -0.010888 | -0.004745 | -0.019133 | 0.258291 |
| num_inq_12_month | -0.018269 | -0.010216 | -0.024310 | -0.017317 | -0.021692 | 0.037867 | 0.034995 | 0.030150 | 0.034072 | 1.000000 | ... | 0.017702 | 0.004177 | 0.040801 | 0.030027 | 0.036274 | 0.037665 | -0.037070 | -0.005109 | -0.024837 | 0.130904 |
| num_card_inq_24_month | -0.011203 | -0.007526 | -0.020134 | -0.016543 | -0.016279 | 0.037378 | 0.035194 | 0.032946 | 0.036066 | 0.901963 | ... | 0.008290 | 0.008430 | 0.037722 | 0.023515 | 0.030282 | 0.033351 | -0.039109 | -0.002521 | -0.022003 | 0.115971 |
| num_card_12_month | -0.015882 | -0.014330 | -0.010588 | -0.013260 | -0.002793 | 0.018590 | 0.007625 | 0.016962 | 0.008523 | 0.017702 | ... | 1.000000 | 0.110687 | 0.003879 | 0.003308 | -0.000449 | 0.010787 | -0.009833 | -0.022956 | 0.002878 | 0.028948 |
| num_auto_ 36_month | -0.000527 | 0.001825 | 0.001013 | 0.001666 | -0.000250 | 0.002128 | 0.005726 | 0.002660 | 0.001563 | 0.004177 | ... | 0.110687 | 1.000000 | -0.009198 | -0.008404 | -0.009326 | -0.002606 | -0.000397 | -0.016421 | -0.008177 | 0.006338 |
| uti_open_card | -0.025588 | -0.027921 | -0.046747 | -0.033019 | -0.048297 | 0.057994 | 0.031946 | 0.038531 | 0.037530 | 0.040801 | ... | 0.003879 | -0.009198 | 1.000000 | 0.749143 | 0.749256 | 0.846833 | -0.021142 | -0.004465 | -0.032851 | 0.209379 |
| pct_over_50_uti | -0.028925 | -0.027622 | -0.039747 | -0.030550 | -0.039013 | 0.042795 | 0.024165 | 0.031204 | 0.027323 | 0.030027 | ... | 0.003308 | -0.008404 | 0.749143 | 1.000000 | 0.566553 | 0.630853 | -0.019144 | -0.003434 | -0.027878 | 0.168094 |
| uti_max_credit_line | -0.013396 | -0.021695 | -0.035542 | -0.027794 | -0.041591 | 0.037550 | 0.020911 | 0.023445 | 0.024718 | 0.036274 | ... | -0.000449 | -0.009326 | 0.749256 | 0.566553 | 1.000000 | 0.634889 | -0.016914 | -0.004515 | -0.022459 | 0.158442 |
| pct_card_over_50_uti | -0.019086 | -0.020415 | -0.041107 | -0.028352 | -0.039819 | 0.052665 | 0.022819 | 0.033175 | 0.030772 | 0.037665 | ... | 0.010787 | -0.002606 | 0.846833 | 0.630853 | 0.634889 | 1.000000 | -0.016418 | -0.000885 | -0.033578 | 0.174590 |
| ind_XYZ | -0.004480 | 0.000823 | 0.008604 | 0.010351 | 0.007888 | -0.023432 | -0.007584 | -0.013731 | -0.010888 | -0.037070 | ... | -0.009833 | -0.000397 | -0.021142 | -0.019144 | -0.016914 | -0.016418 | 1.000000 | 0.006390 | 0.017410 | -0.040863 |
| rep_income | 0.002867 | -0.001124 | 0.014881 | 0.006632 | 0.015414 | -0.003573 | -0.009195 | -0.016240 | -0.004745 | -0.005109 | ... | -0.022956 | -0.016421 | -0.004465 | -0.003434 | -0.004515 | -0.000885 | 0.006390 | 1.000000 | 0.013797 | -0.000740 |
| rep_education | 0.010655 | 0.006176 | 0.026740 | 0.024705 | 0.020083 | -0.023003 | -0.011348 | -0.016327 | -0.019133 | -0.024837 | ... | 0.002878 | -0.008177 | -0.032851 | -0.027878 | -0.022459 | -0.033578 | 0.017410 | 0.013797 | 1.000000 | -0.030515 |
| Def_ind | -0.090389 | -0.112316 | -0.101712 | -0.080229 | -0.087758 | 0.278412 | 0.242955 | 0.247359 | 0.258291 | 0.130904 | ... | 0.028948 | 0.006338 | 0.209379 | 0.168094 | 0.158442 | 0.174590 | -0.040863 | -0.000740 | -0.030515 | 1.000000 |
21 rows × 21 columns
Certain pair of predictor features possessed low positive correlation with each other, such as tot_balance and avg_card_balance, which entirely make sense. Though these relationships are interesting, we are more concerned with the relationship between the predictor features and the target variable Def_ind. The following predictor features possessed a weak positive relationship to the indicator of default Def_ind; num_acc_30d_past_due_12_months, num_acc_30d_past_due_6_months, num_mortgage_currently_past_due, tot_amount_currently_past_due and num_card_12_month. While these following predictor features possessed a weak negative relationship to the default indactor; avg_bal_cards and credit_age. These observation will be important later in Section 5: How Model Improves Decision Making?.
We imported the test dataset, Test_R-197135_Candidate Attach #2_JDSE_SRF #456.csv through Pandas.read_csv function; we named the imported test dataframe as test_df. For reader convenience, we printed test_df to the console, so they are able observe the what was the resulted dataset.
test_df=pd.read_csv("Test_R-197135_Candidate Attach #2_JDSE_SRF #456.csv")
test_df
| tot_balance | avg_bal_cards | credit_age | credit_age_good_account | credit_card_age | num_acc_30d_past_due_12_months | num_acc_30d_past_due_6_months | num_mortgage_currently_past_due | tot_amount_currently_past_due | num_inq_12_month | ... | num_card_12_month | num_auto_ 36_month | uti_open_card | pct_over_50_uti | uti_max_credit_line | pct_card_over_50_uti | ind_XYZ | rep_income | rep_education | Def_ind | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75061.45088 | 11051.42462 | 191 | 103 | 220 | 0 | 0 | 0 | 0.000000 | 0 | ... | 0 | 0 | 0.417116 | 0.490809 | 0.400379 | 0.429427 | 1 | 200321.9635 | high_school | 0 |
| 1 | 89792.74848 | 13839.37518 | 140 | 145 | 152 | 1 | 0 | 0 | 0.000000 | 0 | ... | 0 | 1 | 0.472116 | 0.505581 | 0.655517 | 0.501279 | 0 | 168452.9762 | high_school | 0 |
| 2 | 95928.23392 | 10437.19476 | 343 | 220 | 388 | 2 | 0 | 0 | 19530.997450 | 0 | ... | 0 | 1 | 0.394099 | 0.551539 | 0.309663 | 0.482915 | 1 | 190633.9622 | other | 0 |
| 3 | 124957.43040 | 17413.10572 | 232 | 97 | 235 | 0 | 0 | 0 | 0.000000 | 0 | ... | 2 | 1 | 0.492846 | 0.540109 | 0.590457 | 0.466224 | 1 | 106712.5622 | high_school | 0 |
| 4 | 75058.13462 | 12326.23680 | 236 | 165 | 280 | 0 | 0 | 0 | 0.000000 | 0 | ... | 1 | 1 | 0.381452 | 0.344772 | 0.526555 | 0.345455 | 0 | 173172.1864 | college | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 62927.10171 | 16602.57606 | 271 | 127 | 291 | 0 | 0 | 0 | 887.199204 | 0 | ... | 1 | 0 | 0.396578 | 0.519155 | 0.301686 | NaN | 1 | 151388.6128 | college | 0 |
| 4996 | 98348.24852 | 16093.02091 | 141 | 117 | 184 | 0 | 0 | 0 | 0.000000 | 0 | ... | 1 | 0 | 0.447068 | 0.523186 | 0.426136 | 0.509175 | 0 | 105431.9853 | college | 0 |
| 4997 | 49262.82310 | 10029.11290 | 321 | 144 | 367 | 0 | 0 | 0 | 0.000000 | 6 | ... | 0 | 0 | 0.476359 | 0.449276 | 0.524794 | 0.578619 | 0 | 110293.0904 | college | 0 |
| 4998 | 116989.85900 | 15803.98673 | 282 | 153 | 275 | 0 | 0 | 0 | 0.000000 | 1 | ... | 0 | 0 | 0.379345 | 0.505389 | 0.401324 | 0.497443 | 1 | 140715.4635 | college | 0 |
| 4999 | 106157.16290 | 15439.35593 | 182 | 68 | 185 | 0 | 0 | 0 | 0.000000 | 0 | ... | 1 | 0 | 0.406199 | 0.479625 | 0.424671 | 0.396460 | 0 | 140042.8599 | college | 0 |
5000 rows × 21 columns
test_df is the test dataset, it containthe same predictor features as train_df and five-thousand observations. To clean the dataset for the purpose of its use to test the trained logistic regression and random forest models, we had to repeat the formatting process that we conducted on training dataset, on the test dataset.
An additional note regarding missing data; as the eiher trained model could not handle missing data, yet again we had to data observations that had missing values.
After the following data formatting steps, test_df size changed from (5000, 21) to (4133, 21), and was renamed to test_df2.
print(test_df.isna().any()); # Checks for missing data in each column
print(" ");
print(test_df.isna().sum()); # The number of missing data in each column
print(" ");
print(test_df.dtypes)
print(" ")
test_df2=test_df.dropna();
a=list(test_df2['rep_education'])
b=[];
for i in range(len(a)):
if a[i]=='Other':
b.append(1)
elif a[i]=='high_school':
b.append(2)
elif a[i]=='college':
b.append(3)
elif a[i]=='graduate':
b.append(4)
else:
b.append(0);
test_df2['rep_education']=b;
print('Number of duplicate accounts: '+str(sum(test_df2.duplicated()))); # Number of duplicate obs.
print(" ");
print("Size of train_df: " +str(test_df.shape));
print("Size of train_df2 "+str(test_df2.shape));
tot_balance False avg_bal_cards False credit_age False credit_age_good_account False credit_card_age False num_acc_30d_past_due_12_months False num_acc_30d_past_due_6_months False num_mortgage_currently_past_due False tot_amount_currently_past_due False num_inq_12_month False num_card_inq_24_month False num_card_12_month False num_auto_ 36_month False uti_open_card False pct_over_50_uti False uti_max_credit_line False pct_card_over_50_uti True ind_XYZ False rep_income True rep_education True Def_ind False dtype: bool tot_balance 0 avg_bal_cards 0 credit_age 0 credit_age_good_account 0 credit_card_age 0 num_acc_30d_past_due_12_months 0 num_acc_30d_past_due_6_months 0 num_mortgage_currently_past_due 0 tot_amount_currently_past_due 0 num_inq_12_month 0 num_card_inq_24_month 0 num_card_12_month 0 num_auto_ 36_month 0 uti_open_card 0 pct_over_50_uti 0 uti_max_credit_line 0 pct_card_over_50_uti 489 ind_XYZ 0 rep_income 410 rep_education 4 Def_ind 0 dtype: int64 tot_balance float64 avg_bal_cards float64 credit_age int64 credit_age_good_account int64 credit_card_age int64 num_acc_30d_past_due_12_months int64 num_acc_30d_past_due_6_months int64 num_mortgage_currently_past_due int64 tot_amount_currently_past_due float64 num_inq_12_month int64 num_card_inq_24_month int64 num_card_12_month int64 num_auto_ 36_month int64 uti_open_card float64 pct_over_50_uti float64 uti_max_credit_line float64 pct_card_over_50_uti float64 ind_XYZ int64 rep_income float64 rep_education object Def_ind int64 dtype: object Number of duplicate accounts: 0 Size of train_df: (5000, 21) Size of train_df2 (4133, 21)
C:\Users\Barry\AppData\Local\Temp\ipykernel_15592\4022763135.py:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df2['rep_education']=b;
Section 2: Logistic Regression
Logistic regression is a supervised machine learning classifier algorithm for estimating the conditional probability, $P(Y|X)$; given matrix $X$ is a dataset of features, and $Y$ is a binomially distributed target random variable [8]. Peeling back the core of any neural network is its loss function (or cost function); the loss functon is what is used to attain a gradient (e.g. a vector derivative) for the Gradient descent algorithm which optimizes the weight matrices in the backward propagation process of the neruel network's training.
As the loss function will be important for explaining regularization, we constructed the logistic regression function here. Suppose $Y$ is a Bernoulli distributed random variable and matrix $X$ is a collection of independent continuous or discete random variables - with no need for homoscedasticity - in otherword similar variences. In regards to the conditional probability, P(Y=y,X=x), we have
$$P(Y=y,X=x)=p^{y}(1-p)^{1-y},$$
though we do know the value of $p$, the probability of $y$ being $1$. We can remedy this by an approximation of p; suppose $\beta$ is a weight vector and a $\sigma(z)$ is the sigmoid function, then can derive the following,
$$z=\beta_{0}+\sum^{n}_{i} \beta_{i} x_{i} = \beta^{T}X \to P(Y=1,X=x)=\sigma(z)=\frac{1}{1+e^{-z}}.$$
Plugging in the approximate of $p$ into Bernoulli distribution formula we attain the estimated Bernoulli formula,
$$P(Y=y,X=x)=\sigma(\beta^{T}X)^{y}(1-\sigma(\beta^{T}X))^{1-y},$$
and by calculating Maximum likelihood estimate (MLE) of said estimated Bernoulli formulation, we are able to attain the loss function of logistic regression.
$$L(\beta)=\prod_{i}^{n} P(Y=y|X=x)=\prod_{i}^{n} \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}}$$
$$ \to ln(L(\beta))= \ln\bigg( \prod_{i}^{n} P(Y=y|X=x)=\prod_{i}^{n} \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}}\bigg)$$
$$ \to LL(\beta))= \sum_{i}^{n} \ln\bigg( \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}} \bigg)$$
$$ \to LL(\beta))= \sum_{i}^{n} y_{i}\ln(\beta^{T}x_{i})+(1-y_{i})\ln(1-\beta^{T}x_{i}). \square$$
This MLE is the loss function to the logistic regression, and in essence to attained the optimum (e.g., desired) beta weight vector, we need to minimize (e.g. solve) this optimization problem. The most conventional approach is to utilize a gradient descent algorithm [8], which gradient derivative is defined by taking the partial derivative of the loss function in respect to each beta weight; $\frac{\partial LL(\beta)}{\partial \beta_{i}}$. To guarantees convergence to a global optimum by Gradient descent, matrix X must be symmetric positive definite (e.g. a convex optimization problem), with a learning rate less then $1$. This is due to the fact that the partial derivative of the loss function needs to be greater then zero but less than $1$ when sufficiently close to the minima (e.g. bottom of the valley), otherwise gradient descent will diverge or will get stuck insufficiently close to the minima - returning the wrong beta weight vector [9]. These details are very important to understand how regularization addresses overfitting in the logistic regression model; but we will further discuss regularization after the following analysis.
We seperated the predictor features from train_df2 into a dataframe named train_X to form matrix $X$ and the target variable into dataframe named $train_Y$. Perform the same process for test_df2; we seperated the predictor features from test_df2 into test_X, and the target variable into test_Y. We then were able to implement the logistic regression function from sklearn.linear_model for our analysis.
list(train_df2.columns[0:20])
['tot_balance', 'avg_bal_cards', 'credit_age', 'credit_age_good_account', 'credit_card_age', 'num_acc_30d_past_due_12_months', 'num_acc_30d_past_due_6_months', 'num_mortgage_currently_past_due', 'tot_amount_currently_past_due', 'num_inq_12_month', 'num_card_inq_24_month', 'num_card_12_month', 'num_auto_ 36_month', 'uti_open_card', 'pct_over_50_uti', 'uti_max_credit_line', 'pct_card_over_50_uti', 'ind_XYZ', 'rep_income', 'rep_education']
labels=list(train_df2.columns[0:20]);
train_X=train_df2[labels]; train_Y=train_df2['Def_ind'];
test_X=test_df2[labels]; test_Y=test_df2['Def_ind'];
LR=LogisticRegression()
clf=LR.fit(train_X,train_Y)
prob_d=clf.predict_proba(test_X);
pred=clf.predict(test_X);
score=clf.score(test_X,test_Y);
print(pd.DataFrame(np.array([prob_d[:,0].T,prob_d[:,1].T,test_Y.T,pred]).T,
columns=['prob_Def_Ind=0','prob_Def_Ind=1','Def_ind',"pred"]));
print(" ");
print("Beta_Coefficients from Logistic Regression model:");
print(pd.DataFrame(np.array(clf.coef_),
columns=list(train_df2.columns[0:20])).transpose());
print(" ");
print("Accuracy of the trained model was "+str(score));
cm=metrics.confusion_matrix(test_Y, pred);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm,annot=True,fmt=".3f",
linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score);
plt.title(all_sample_title,size=15);
print(" ")
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));
logit_roc_auc=roc_auc_score(test_Y,clf.predict(test_X));
fpr,tpr,thresholds=roc_curve(test_Y, clf.predict_proba(test_X)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Logistic Regression (area = %0.2f)'%logit_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('Log_ROC');
plt.show();
prob_Def_Ind=0 prob_Def_Ind=1 Def_ind pred
0 0.863948 0.136052 0.0 0.0
1 0.881070 0.118930 0.0 0.0
2 0.058506 0.941494 0.0 1.0
3 0.950192 0.049808 0.0 0.0
4 0.899325 0.100675 0.0 0.0
... ... ... ... ...
4128 0.870301 0.129699 0.0 0.0
4129 0.912065 0.087935 0.0 0.0
4130 0.911367 0.088633 0.0 0.0
4131 0.948672 0.051328 0.0 0.0
4132 0.921004 0.078996 0.0 0.0
[4133 rows x 4 columns]
Beta_Coefficients from Logistic Regression model:
0
tot_balance -0.000001
avg_bal_cards -0.000115
credit_age -0.005132
credit_age_good_account 0.000357
credit_card_age 0.000814
num_acc_30d_past_due_12_months 0.000283
num_acc_30d_past_due_6_months 0.000061
num_mortgage_currently_past_due 0.000064
tot_amount_currently_past_due 0.000266
num_inq_12_month 0.001132
num_card_inq_24_month 0.001739
num_card_12_month 0.000109
num_auto_ 36_month 0.000022
uti_open_card 0.000214
pct_over_50_uti 0.000167
uti_max_credit_line 0.000166
pct_card_over_50_uti 0.000181
ind_XYZ -0.000130
rep_income 0.000001
rep_education -0.000064
Accuracy of the trained model was 0.8826518267602226
Recall: 0.9195342820181113
Precision: 0.9533261802575107
Accuracy: 0.8826518267602226
F-measure: 0.9361253786382194
The trained logistic regression model (LR model) was tested on test_X and test_Y, and it was able to correctly predict roughly $88.30\%$ of accounts that would default on their credit card within the past 18 month since their credit account was opened by bank XYZ. To further ground this result, we computate the confusion matrix to calcalute recall, presicion, accuracy and F-measure of the trained LR model. The result was that recall was approximatly $92.00\%$, which means the trained LR model correctly predicted $92.00\%$ of the true positive cases. Precision was $95.00\%$, which means that from all the cases marked predicted as true, $95.00\%$ of said cases were actually true. As attained by clf.score(), the accuracy of the trained LR model was $88.30\%$; which simply means that the train LR model was able to predict correctly $88.30\%$ of the test accounts.
As we had to rank a trained LR model against a trained random Forest model, and we decided to computate the F-measure and ROC curve [10]; the resulting F-measure was on the high end of the scale - which is desired. In regards to the ROC curve, it is desire that the precision rate curve remains far away from the forty=five degree angled line curve for as long as possible. Nevertheless the F-measure of the train LR model was measured to be $0.9361$, which means the would be considered a good model; thoough we can still improve said LR model to attain a higher accuracy.
We believed the trained LR model was overfitted against the training dataset, train_df2, and hence required regularization to assist gradient descent to get closer to the minima of the loss function. We remind the reader that gradient descent can only reaches the minima in $O(1/\epsilon)$ steps, but prior to achieving $O(1/\epsilon)$ steps - the gradient descent is sufficiently close to the minima but not at the minima.
Regularization of a neural network can take the form of L1-Regularization with an added L1-normalization proportioned by $frac{\lambda}{2n}$ term to the cost function, or L2-Regularization with an added L2-normalization proportioned by $frac{\lambda}{2n}$ term to the cost function, or dropout technique - which give a certain probability that a element of the weight matrix is dropped to zero. In our case we choice L2-regularization has it has a greater effect then L1-regularization at reducing variance in a train neural network (e.g., trained model). Recollect the LR loss function, to make it the cost function we need to average it, then we can add L2-regularization term to attain the new regularizad cost function,
$$Cost(\beta))= \frac{1}{n}\sum_{i}^{n} y_{i}\ln(\beta^{T}x_{i})+(1-y_{i})\ln(1-\beta^{T}x_{i})+\frac{\lambda}{2n}.$$
In simple term, L1 and L2 regularization is an additional dial, which is similar to learning rate at adjusting the gradient magnitude to conform to Theorem 6.2 from [9].
Theorem 6.2 Suppose the function $f : \R^{n} \to \R$ is convex and differentiable, and that its gradient is Lipschitz continuous with constant $L > 0$, i.e., we have that $\Vert \bigtriangledown f(x) \bigtriangledown f(y) \Vert_{2} \leq L \Vert x - y \Vert$ for any $x$,$y$. Then if we [ran the] gradient for $k$ iterations with step size $t_{i}$ chosen using backtracking line search on each iteration, it will yield a solution $f^{(k)}$ which satisfies
$$f(x^{(k)})-f(x^{*}) \leq \frac{\vert x^{0} - x^{0} \Vert_{2}^{2}}{2 t_{\text{min}}k}$$
where $t_min=\min{1,\beta/L}$.
In simple terms want to select a $\lambda$ value that adjust the gradient in such a way to get closer to the minima or in other word, attain a better approximation to the optimal beta weight vector (e.g. minima).
To implement L2-regularization is rather simple and only requires to pass these parameters penalty="l2", solver="liblinear", tol=1e-6,max_iter=int(1e6), warm_start=True, intercept_scaling=10000.0 into the sklearn.linear_model.LogisticRegression. Lastly we performed the same model validation process as last time.
LR=LogisticRegression(penalty="l2",solver="liblinear",tol=1e-6,
max_iter=int(1e6),warm_start=True,intercept_scaling=10000.0);
clf=LR.fit(train_X,train_Y);
prob_d=clf.predict_proba(test_X);
pred=clf.predict(test_X);
score=clf.score(test_X,test_Y);
print(pd.DataFrame(np.array([prob_d[:,0].T,prob_d[:,1].T,test_Y.T,pred]).T,
columns=['prob_Def_Ind=0','prob_Def_Ind=1','Def_ind',"pred"]));
print(" ");
print("Beta_Coefficients from Reg. Logistic Regression model:");
print(pd.DataFrame(np.array(clf.coef_),
columns=list(train_df2.columns[0:20])).transpose());
print(" ");
print("Accuracy of the trained model was "+str(score));
cm=metrics.confusion_matrix(test_Y,pred);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm, annot=True,fmt=".3f",
linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score);
plt.title(all_sample_title,size=15);
print(" ");
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));
logit_roc_auc=roc_auc_score(test_Y,clf.predict(test_X));
fpr,tpr,thresholds=roc_curve(test_Y,clf.predict_proba(test_X)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Logistic Regression (area = %0.2f)'%logit_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('Log_ROC');
plt.show();
prob_Def_Ind=0 prob_Def_Ind=1 Def_ind pred
0 0.933852 0.066148 0.0 0.0
1 0.761364 0.238636 0.0 0.0
2 0.712649 0.287351 0.0 0.0
3 0.959617 0.040383 0.0 0.0
4 0.935899 0.064101 0.0 0.0
... ... ... ... ...
4128 0.846793 0.153207 0.0 0.0
4129 0.935317 0.064683 0.0 0.0
4130 0.752892 0.247108 0.0 0.0
4131 0.967526 0.032474 0.0 0.0
4132 0.945871 0.054129 0.0 0.0
[4133 rows x 4 columns]
Beta_Coefficients from Reg. Logistic Regression model:
0
tot_balance -1.567375e-06
avg_bal_cards -1.214146e-04
credit_age -4.323569e-03
credit_age_good_account 5.987032e-04
credit_card_age -4.973357e-04
num_acc_30d_past_due_12_months 9.623256e-01
num_acc_30d_past_due_6_months 2.252230e-01
num_mortgage_currently_past_due 2.290428e-01
tot_amount_currently_past_due 1.910164e-05
num_inq_12_month 3.955467e-01
num_card_inq_24_month -7.654231e-02
num_card_12_month 1.820608e-01
num_auto_ 36_month 3.647885e-02
uti_open_card 8.876689e-01
pct_over_50_uti 6.728455e-01
uti_max_credit_line 6.641946e-01
pct_card_over_50_uti 7.199934e-01
ind_XYZ -3.106927e-01
rep_income 6.300817e-07
rep_education -9.712513e-02
Accuracy of the trained model was 0.9044277764335834
Recall: 0.92372234935164
Precision: 0.9745171673819742
Accuracy: 0.9044277764335834
F-measure: 0.9484401514162641
The regularized trained logistic regression model (reg LR model) was tested on test_X and test_Y, and it was able to correctly predict roughly $94.00\%$ of accounts that would default on their credit card within the past 18 month since their credit account was opened by bank XYZ. To further ground this result, we computate the confusion matrix to calcalute recall, presicion, accuracy and F-measure of the trained LR model. The result was that recall was approximatly $94.00\%$, which means the trained reg LR model correctly predicted $97.50\%$ of the true positive cases. Precision was $97.50\%$, which means that from all the cases marked predicted as true, $97.50\%$ of said cases were actually true. As attained by clf.score(), the accuracy of the trained LR model was $90.40\%$; which simply means that the train reg LR model was able to predict correctly $90.40\%$ of the test accounts. This is a significant improvement over the non-regularized train LR model; which also further confirms that the non-reg LR model was slightly overfitted - though the L2 regularization was able to reduce varience in the model enough to improve the accuracy of the model.
Section 3: Random Forest
Random Forest algorithm is a classifier based on a combination of tree predictors [11], which are PRN (pseudo-random number) sampled from indepent and identically distributed random variables as stated in Definition 1.1[11].
To implement sklearn.ensemble.RandomForestClassifier, we had to standardize each predictor feature and target variable to a standard normal distribution; we accomplished this aim through StandardScaler() and fit_transform(). We completed this process for both train_df2 and test_df2. With the standardizse data, ss_X and ss_Y, we split the standardize dataframe into train and test predictor feature sets, and train and test target variable sets. We were then able to train and test the random forest model. Lastly we conduct the same validation process as we conduct for the LR model.
from sklearn.model_selection import train_test_split
ss_X_df.columns[0:20]
Index(['tot_balance', 'avg_bal_cards', 'credit_age', 'credit_age_good_account',
'credit_card_age', 'num_acc_30d_past_due_12_months',
'num_acc_30d_past_due_6_months', 'num_mortgage_currently_past_due',
'tot_amount_currently_past_due', 'num_inq_12_month',
'num_card_inq_24_month', 'num_card_12_month', 'num_auto_ 36_month',
'uti_open_card', 'pct_over_50_uti', 'uti_max_credit_line',
'pct_card_over_50_uti', 'ind_XYZ', 'rep_income', 'rep_education'],
dtype='object')
X_test
| tot_balance | avg_bal_cards | credit_age | credit_age_good_account | credit_card_age | num_acc_30d_past_due_12_months | num_acc_30d_past_due_6_months | num_mortgage_currently_past_due | tot_amount_currently_past_due | num_inq_12_month | ... | num_card_12_month | num_auto_ 36_month | uti_open_card | pct_over_50_uti | uti_max_credit_line | pct_card_over_50_uti | ind_XYZ | rep_income | rep_education | Def_ind | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14056 | 0.667604 | 0.059683 | 0.492772 | 0.721140 | 0.382498 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | -0.559488 | -0.436351 | -0.420380 | 0.600560 | -0.420092 | -1.108982 | 1.738244 | 0.468127 | 0.252436 | -0.336847 |
| 3466 | 0.010697 | -0.043129 | -1.580245 | -2.437935 | -1.838257 | 1.781311 | -0.167036 | -0.175574 | 1.360689 | 0.335455 | ... | -0.559488 | -0.436351 | 1.235583 | 1.274821 | 2.250931 | 1.533553 | 1.738244 | 0.841237 | -1.283487 | -0.336847 |
| 9307 | -0.200614 | -0.721324 | 1.502004 | 1.031869 | 1.407462 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | 1.202897 | ... | -0.559488 | 2.208918 | -1.471210 | -1.054665 | -0.342764 | -0.801214 | 1.738244 | 0.944058 | 0.252436 | -0.336847 |
| 9338 | 0.758774 | 0.908916 | -0.530098 | -0.547669 | -0.362930 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | 0.335455 | ... | -0.559488 | -0.436351 | 0.234652 | 0.128746 | 0.190185 | 0.627270 | -0.575293 | 0.512623 | -1.283487 | -0.336847 |
| 18813 | -0.136315 | 0.550532 | 0.656431 | -0.159258 | 0.941570 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | -0.559488 | -0.436351 | -1.625749 | -2.048498 | -0.921564 | -2.090365 | -0.575293 | 0.617083 | 0.252436 | -0.336847 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13636 | 2.168404 | 1.744654 | 0.015433 | -0.392304 | 0.366969 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | -0.559488 | -0.436351 | -0.362622 | 0.722125 | 0.002852 | 0.079278 | -0.575293 | 0.630887 | -1.283487 | -0.336847 |
| 6996 | -0.391903 | -0.317970 | -0.584651 | -0.936080 | -1.310245 | 1.781311 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | -0.559488 | -0.436351 | 0.632993 | 0.964332 | 0.341807 | 0.067612 | 1.738244 | -0.498523 | 0.252436 | -0.336847 |
| 5805 | 1.040378 | 1.173876 | 1.815685 | 1.420280 | 1.190046 | 1.781311 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | -0.559488 | 2.208918 | 0.736162 | 0.196673 | 1.353546 | -0.066670 | 1.738244 | -0.539789 | 1.788360 | -0.336847 |
| 11259 | 0.711189 | 1.261833 | -0.216418 | -1.117338 | -0.735644 | 3.896554 | 5.521415 | 5.695591 | 4.375447 | -0.531987 | ... | -0.559488 | -0.436351 | -0.426881 | -1.330365 | -0.939289 | -0.874243 | -0.575293 | -0.341298 | 1.788360 | -0.336847 |
| 16995 | 1.064149 | 1.178914 | -1.743905 | -1.013762 | -2.009084 | -0.333932 | -0.167036 | -0.175574 | -0.197555 | -0.531987 | ... | 1.495466 | -0.436351 | 1.693768 | 0.282412 | 1.400116 | 0.542642 | -0.575293 | 2.287862 | 0.252436 | -0.336847 |
4996 rows × 21 columns
ss_X=StandardScaler().fit_transform(train_df2);
ss_Y=StandardScaler().fit_transform(test_df2);
ss_X_df=pd.DataFrame(ss_X, columns=train_df2.columns,index=train_df2.index);
ss_Y_df=pd.DataFrame(ss_Y, columns=test_df2.columns,index=test_df2.index);
labels=list(ss_X_df.columns[0:20]);
X_train=ss_X_df[labels]; y_train=ss_X_df['Def_ind'];
X_test=ss_Y_df[labels]; y_test=ss_Y_df['Def_ind'];
X_train,X_test,y_train,y_test=train_test_split(ss_X_df[labels],train_Y,
test_size=0.3,random_state =0);
classifier=RandomForestClassifier(n_estimators=100);
print("classifer: RandomForestClassifier");
classifier=classifier.fit(X_train,y_train);
predicted=classifier.predict(X_test);
score2=classifier.score(X_test,y_test);
print(pd.concat([y_test,pd.Series(predicted,index=y_test.index,
name='predicted')], axis=1));
print(" ");
print(classifier.predict_proba(X_test));
print(" ");
print("Accounts that Defaulted: "+str(sum(y_test))+" "+
"Accounts predcited to Defualt: "+str(sum(predicted)));
print("Accuracy: ", accuracy_score(y_test, predicted));
print(" ");
print("Decision path of Random Forest algorithm:")
print(classifier.decision_path(X_test));
clf2=DecisionTreeClassifier(max_depth=2,random_state=0);
clf2=clf2.fit(X_test,y_test);
tree.plot_tree(clf2);
cm=metrics.confusion_matrix(y_test,predicted);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm, annot=True,fmt=".3f",
linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score2);
plt.title(all_sample_title,size=15);
print(" ");
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));
rf_roc_auc=roc_auc_score(y_test,classifier.predict(X_test));
fpr,tpr,thresholds=roc_curve(y_test,classifier.predict_proba(X_test)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Random Forest (RF) (area = %0.2f)'%rf_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('RF_ROC');
plt.show();
classifer: RandomForestClassifier
Def_ind predicted
14056 0 0
3466 0 0
9307 0 0
9338 0 0
18813 0 0
... ... ...
13636 0 0
6996 0 0
5805 0 0
11259 0 0
16995 0 0
[4996 rows x 2 columns]
[[0.99 0.01]
[0.79 0.21]
[0.99 0.01]
...
[0.9 0.1 ]
[0.81 0.19]
[0.71 0.29]]
Accounts that Defaulted: 494 Accounts predcited to Defualt: 157
Accuracy: 0.91693354683747
Decision path of Random Forest algorithm:
(<4996x181700 sparse matrix of type '<class 'numpy.int64'>'
with 8461917 stored elements in Compressed Sparse Row format>, array([ 0, 1921, 3840, 5673, 7506, 9389, 11196, 13059,
14886, 16679, 18436, 20185, 22054, 23889, 25740, 27505,
29396, 31155, 32944, 34729, 36616, 38507, 40390, 42221,
44098, 45851, 47666, 49493, 51286, 53165, 55000, 56787,
58572, 60451, 62210, 63971, 65786, 67625, 69462, 71229,
73070, 74913, 76666, 78401, 80246, 82195, 83992, 85849,
87636, 89501, 91300, 93111, 94850, 96639, 98436, 100215,
102050, 103787, 105560, 107389, 109192, 110943, 112826, 114637,
116490, 118303, 120176, 121949, 123756, 125605, 127432, 129271,
131120, 132901, 134724, 136481, 138288, 140051, 141878, 143667,
145552, 147277, 149090, 150903, 152722, 154511, 156354, 158231,
160060, 161827, 163626, 165457, 167234, 168985, 170840, 172631,
174488, 176281, 178072, 179941, 181700], dtype=int32))
Recall: 0.9222979954536061
Precision: 0.9913371834740116
Accuracy: 0.91693354683747
F-measure: 0.9555722085429824
Overall predictive capacity of the random forest algorithm was slightly better than the regularized logistic regression. The slight improvement in the overal accuracy of the random forest algorithm resulted in a small increased in the F-measure, but a completely negligible increase in the ROC curve and thus the area captured underneath the curve.
The way to regularize random forest is the pruning of the decision tree, where branch decisions are cut from the decision tree. But in our case the decision tree lacks depth, and hence cannot be prune - as all decision branches are vital to the decision tree. In effect there is no means to regularize our model, as all branches are vital and cannot be pruned to reduce any possible overfitting in the model.
Section 4: Compare Results and Select Model
From our analysis we observed that for this collection of data, the logistic regression classifier was slightly more prone to overfitting the training set, than the random forest classifier. Though the overfitting was somewhat mitigated with L2-regularization of the logistic regression, which improved the ovarall performance of the logistic regression classifier.
Observing the F-measure and ROC graph, it is evident that either algorithm would be suited as a model to predict the likelihood of a future defualt by a credit card applicant. As the generality in predictive capacity of the random forest classifier seems to be significantly greater then the logistic regression for this dataset and problem, but it is only slightly greater than the former algorithm. In consequence we would select random forest as the model to review credit card applications, because it possessed a better F-measure.
Section 5: How Model Improves Decision Making?
As no model is hundred percent correct hundred percent of the time; it is important to validate the prediction result's of the random forest model. In essence the applicants data would be formatted into a readable csv file, imported into a python envorinment and feed into the model, so to return its prediction results, which would be added to the applicant dataset.
In a real-world application, we would not possess a training set to compare the results of the random forest model, we would have to rely on correlative relationships between the predictor features and the target variable, Def_ind. We remind the reader that in Section 1, through the correlation matrix, we were able to discern that the following features possessed a weak positive relationship with the target variable, Def_ind:
num_acc_30d_past_due_12_months, num_acc_30d_past_due_6_months, num_mortgage_currently_past_due, tot_amount_currently_past_due and num_card_12_month.
With which that the applicant dataset could be reduce to the accounts that the random forest model predicts would default on their credit card; let's label it the defualt dataset. One then would discern if any of these account possessed one or more of the weak positive relation features, and if they did - it would validate the model prediction; as possessing one of these predictor feature increases the probability of defualt.
References:
The enternal links are also references. Just click the linked numbers to be directed to the referenced source.
- [1]: Stackoverflow: How do I set custom CSS for my IPython/IHaskell/Jupyter Notebook?, https://stackoverflow.com/questions/32156248/how-do-i-set-custom-css-for-my-ipython-ihaskell-jupyter-notebook
- Wan, Shuyan, et al. “Model Selection for Credit Card Approval - Ohio State University.” Www.asc.ohio-State.edu, Ohio State University, https://www.asc.ohio-state.edu/goel.1/STATLEARN/PROJECTS/Presentations/CreditCardApproval.pdf.